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The Methodology of Evaluating 
Social Action Programs 

By Glen G. Cain * and Robinson G. Hollister * 

Apologia 

This paper is largely motivated by our experiences as academics 
who became directly enmeshed in the problems of a public agency 
which was under considerable pressure — generated by both the 
agency staff itself and external factors — to "evaluate” manpower, 
and other social actum, programs. 

It became evident that there were several major obstacles to 
effective evaluation in this context. These obstacles were created 
both by the several types of "actors” necessarily involved in such 
evaluation efforts and by complications and weaknesses in the 
theory and methodology to be applied. Difficulties of communica- 
tion among the “actors,” due both to differences in training and to 
suspicions about motives, often made it hard to distinguish between 
difficulties arising because the theory was weak and those arising 
because adequate theory was poorly understood. 

In this paper we try to separate out some of these issues, both 
those concerning the adequacy of theory and methodology and 

* This research was supported by funds granted to the Institute for 
Research on Poverty, University of Wisconsin, pursuant to the provisions of 
the Economic Opportunity Act of 11X14. Professor Cain and Professor Hollister 
are associated with the University of Wisconsin Department of Economics and 
are members of the Institute staff, The authors are grateful to the following 
persons, who have increased their understanding of the ideas in this paper or 
have commented directly on an earlier draft (or have done both) : David 
Bradford, Frank Cassell, John Evans, Woodrow Oinsburg, Thomas Olennan, 
Robert Levine, Guy Orcutt, Gerald Somers, Ernst Stromsdor/er, Harold Watts, 
Arnold Weber, Burton Weisbrod, and Walter Williams. A longer version of 
Uds paper is available as Discussion Paper 42-69 from the Institute for Re- 
search on Poverty, University of Wisconsin, Madison. An intermediate length 
version will appear in the volume consisting of the Proceedings of the North 
American Conference on Cost-Benefit Analyses, held in Madison, Wisconsin, 
May 14-16, 1969. 
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those relating to the various sorts of actors. We have sought to 
couch the discussion in language that will make it available to 
academics, who we feel need a heightened awareness of the more 
practical difficulties of execution of evaluations in the social action 
context — and to public agency and political personnel, who we 
believe would benefit from increased sensitivity to the ways in 
which careful consideration of the design and careful control of 
evaluations can increase the power of the information derived 
from sueh efforts. The attempt to reach both audiences in one 
paper produces a mixture of elements bound to strike members 
of either audience as, at some points, extremely naive and, at 
others, disturbingly recondite. We can only hope that such reac- 
tions will be transformed into a resolve to initiate a more meaning- 
ful dialogue on these issues, a dialogue we feel is crucial to the 
development of an effective approach to evaluations of social action 
programs. 

Introduction 

This paper began as a discussion of methods of evaluating man- 
power programs — programs which used to consist almost entirely 
of vocational training and various but limited types of assistance 
for the worker searching for jobs within local labor markets. But 
with the recent emphasis on problems of poverty and the disad- 
vantaged worker, manpower programs have come to involve reme- 
dial and general education, to intermesh with community action 
programs providing a variety of welfare services, and, on a trial 
basis, to assist in migration between labor markets. They are part 
of a broader class of programs which, for lack of a better term, 
we might call social action programs. Onr paper will include many 
references to this broader class, and in particular to anti-poverty 
programs. In so doing, we hope to provide a more general and 
more relevant perspective on the topic of evaluation methodology. 

We hold the opinion, apparently widely shared, that existing 
evaluations of social action programs, (and we are including our 
own), have fallen short of meeting the standards possible within 
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the disciplines of the social sciences. The reasons for these short- 
comings are easy to identify. The programs typically involve 
investments in human beings, a relatively new area of empirical 
research in economics. They are aimed at such social and political 
goals as equality and election victories, as well as economic objec- 
tives concerning, say, income and employment. They often attempt 
to deliver services on a large enough scale to make a noticeable 
impact upon the community. And at the same time, they are 
expected to provide a quasi-experimental basis for determining 
what programs ought to be implemented and how they ought to 
be run. 

It is not surprising, then, that evaluations of social action 
programs have often not been attempted and when attempted, have 
not been successful. Despite this background, we believe that 
existing data and methods permit evaluations which, while not 
satisfying the methodological purists, can at least provide the 
rules of evidence for judging the degree to which programs have 
succeeded or failed. Specifically, the theme we will develop 5s 
that evaluations should be set up to provide the ingredients of 
an experimental situation: a model suitable for statistical testing, 
a wide range in the values of the variables representing the pro- 
gram inputs, and the judicious use of control groups. 

The paper reflects several backgrounds in which we have bad 
some experience— from economics, the tradition of benefit-cost 
analyses; from the other social sciences, the approach of quasi- 
experimental research; and from a governmental agency, the 
perspective of one initiating and using evaluation studies. Bach 
of these points of view has its own literature which we have by 
no means covered, but to which we are indebted. 1 

Types of Evaluation 

There are two broad types of evaluation. The first, which we 
call "process evaluation," la mainly administrative monitoring. Any 
program must he monitored (or evaluated) regarding the integ- 
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rity of its financial transactions and accounting system. There is 
also an obvious need to check on other managerial functions, 
including whether or not accurate records are being kept. In sum, 
“process evaluation” addresses the question: Given the existence 
of the program, is it being run honestly and administered effi- 
ciently t 

A second type of evaluation, and the one with which we are 
concerned, may be called “outcome evaluation," more familiarly 
known as “cost-benefit analysis." Although both the inputs and 
outcomes of the program require measurements, the toughest 
problem is deciding on and measuring the outcomes. With this 
type of evaluation the whole concept of the program is brought 
into question, and it is certainly possible that a project might be 
judged to be a success or a failure irrespective of how well it 
was being administered. 

A useful categorization of cost-benefit evaluations draws a dis- 
tinction between a priori analyses and ex post analyses. An 
example of a priori analysis is the cost-effectiveness studies of 
weapons systems conducted by tbe Defense Department, which have 
analyzed waT situations where there were no "real outcomes” and, 
thus, no ex post results with which to test the evaluation models. 
Similarly, most evaluations of water resource projects are confined 
to alternative proposals where the benefits and costs are estimated 
prior to the actual undertaking of tbe projects.* Only in the 
area of social action programs such as poverty, labor training, 
and to some extent housing, have substantial attempts been made 
to evaluate programs, not just in terms of before-the-fact esti- 
mates of probable outcomes or in terms of simulated hypothetical 
outcomes, but also on the basis of data actually gathered during 
or after the operation of the program. 

A priori cost-benefit analyses of social action programs can, 
of course, be useful in program planning and feasibility studies, 
but the real demand and challenge lies in ex post evaluations. This 
more stringent demand made of social action programs may say 
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something about the degree of skepticism and lack of sympathy 
Congress (or "society”) has concerning these programs, but this 
posture appears to be one of the facts of political life. 

Problems of the Design of the Evaluation 2A 

A. The Use of Control Groups 

Given the objective of a social action program, the evaluative 
question is: “What difference did the program make?”, and this 
question should be taken literally. We want to know the difference 
between the behavior with the program and the behavior if there 
had been no program. To answer the question, some form of con- 
trol group is essential. We need a basis for comparison — some base 
group that performs the methodological function of a control 
group. Let us consider some alternatives. 

The Before -and- After Study. In the before and after study, 
the assumption is that each subject is his own control (or the 
aggregate is its own control) and that the behavior of the group 
before the program is a measure of performance that would have 
occurred if there had been no program. However, it is well known 
that there are many situations in which this assumption is not 
tenable. We might briefly cite some examples found in manpower 
programs. 

Sometimes the "before situation” is a point in time when the 
participants are at a particularly low state — lower, that is, than 
is normal for the group. The very fact of being eligible for par- 
ticipation in a poverty program may reflect transitory conditions. 
Under such conditions we should expect a “natural" regression 
toward their mean level of performance if we measure their status 
in an "after situation,” even if there were no program in the inter- 
vening period. Using aero earnings as the permanent measure 
of earnings of an unemployed person is an example of attributing 
normality to a transitory status. 

Another similar situation arises when young people are in- 
volved in the program. Ordinary maturation and the acquisition 
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of experience over the passage of time would be expected to 
improve their wages and employment situation. 

There may be some structural change in the personal situa- 
tions of the participants before and after the program, which has 
nothing to do with the program but would vitiate any simple 
before-or-after comparison. We should not, for example, look upon 
the relatively high earnings record of coal miners or packinghouse 
workers as characteristic of their "before situation if, in fact, 
they have been permanently displaced from their jobs. 

As a final example of a situation in which the before-and-after 
comparison is invalid, there is the frequent occurrence of signifi- 
cant environmental changes- — particularly in labor market environ- 
ments — which are characterized by seasonal and cyclical fluctua- 
tions. Is it the program or the changed environment which has 
brought about the change in behavior t All of the above examples 
of invalidated evaluations could have been at least partially 
corrected if the control groups had been other similar persons 
who were in similar situations in the pre training period. 

Control Croups and Small Croup Studies. The particular 
strength of the small scale study is that it greatly facilitates the 
desideratum of random assignments to "treatment groups" and 
"control groups” or, at least, a closely supervised matching of 
treatment and control groups. Its particular shortcoming is that 
it is likely to lack representativeness— both in terms of the charac- 
teristics of the program participants and in terms of the character 
of the program. There is first the problem of a "hot house environ- 
ment” of the small group study. (See discussion of "replicability" 
below.) Second, a wide range of values of the program inputs 
(i.e., in terms of levels of a given treatment or in terms of quali- 
tatively different types of treatments) is less likely to be available 
in a small group study. Third, the small group study may not 
be able to detect the program's differential effects on different types 
of participants (e.g., by age, sex, color, residence, eto.,) either 
because the wide variety of participant types are not available or 
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because their numbers are too small. Finally, it is both a strength 
and a weakness of the small scale study that it fs usually confined 
to a single geographic location. Thus, although "extraneous” noise 
from different environments are eliminated, we may le.\rn little or 
nothing about how the program would operate iu different 
environments. 

Control Groups and Large Group Studies. Tne largo scale 
study, which involves gathering data over a w:de range of en- 
vironments, customarily achieves "control” over the character- 
istics of participants and nonparticipants and over programs and 
environmental characteristics by statistical methods, rather than 
by randomization or careful matching, individual by individual. 
These studies have the capability of correcting each of the short- 
comings attributed to the small scale studies in the preceding 
paragraph. But because they are almost impossible to operate 
with randomization, the largo scale studies run afoul of the fa- 
miliar problem in which the selectivity of the participants may 
be associated with some unmeasured variable (s) which makes it 
impossible to determine what the net effect of the treatment is. 
Since this shortcoming is so serious in the minds of many analysts, 
particularly statisticians, and because the small scale studies 
have a longer history of usage and acceptability in sociology and 
psychology, it may be worthwhile to defend at greater length the 
large scale studies, which are more common to economists. 

Randomization is seldom attempted for reasons having to do 
with the attitudes of the administrators of a program, local pres- 
sures from the client population, or various logistic problems. 
Indeed, all these reasons may serve to botch an attempted randomi- 
zation procedure. Furthermore, we can say with greater certitude 
that the ideal "double-blind experiment with placebos” is almost 
impossible to achieve. If we are to do something other than 
abandon evaluation efforts in the face of these obstacles to ran- 
domization, we will have to turn to the large scale study and the 
statistical design issues that go along with it. 
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The fact that the programs vary across cities or among admin- 
istrators may be turned to our advantage by viewing these as 
“natural experiments ” 8 which may permit an extrapolation of 
the results of the treatment to the “zero" or “no-treatment" level. 
The analyst should work with the administrator in advance to 
design the program variability in ways which minimize the con- 
founding of results with environmental influences. Furthermore, 
ethical problems raised by deliberately excluding some persons 
from the presumed beneficial treatments are to some extend avoided 
by assignments to differing treatments (although, here again, 
randomization is the ideal way to make these assignments). 

It is difficult at this stage, to provide more than superficial 
observations regarding the choice between small and large-scale 
studies. It would seem that for those evaluations that have a 
design concept which is radically different from existing designs 
or where there is a quite narrow hypothesis which requires de- 
tailed examination, a small group study would be preferable. Con- 
versely, when the concept underlying a program is quite broad 
and where large amounts of resources are to be allocated, the large 
group approach is probably more relevant — a point argued in 
greater detail in our discussion of the “replicability criterion.” 

B. The Replicability Criterion 

A source of friction between administrators of programs and 
those doing evaluation research, usually academicians, is the failure 
to agree upon the level of decision-making for which the results 
of the evaluation are to be used. This failure, which is all the more 
serious because the issue is often not explicitly addressed, leads to 
disputes regarding two related issues — the scope of the evaluation 
study and the selection of variables to be studied. To deal with 
these disputes, we suggest applying the “replicability criterion.” 
We apply this name to the criterion because of the large number 
of cases in which evaluations of concepts have been made on the 
basis of projects which are not likely to be replicable on a large 
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scale or which focus on characteristics of the project which are 
not within the ability of the decision-makers to control. To take an 
extreme example, it has sometimes been stated that the success 
of a compensatory education program depended upon the "warmth 
and enthusiasm” of the teachers. In the context of a nationwide 
program, no administrator has control over the level of "warmth 
and enthusiasm” of teachers. 

It is sometimes argued by administrators that evaluations 
which are based upon samples drawn from many centers of a pro- 
gram are not legitimate tests of the program concept since they 
do not adequately take into account the differences in the details 
of individual projects or of differentiated populations. These 
attitudes frequently lead the administrators or other champions 
of the program to select, either ex ante or ex post, particular 
“pet” projects for evaluations that "really count." In the extreme, 
this approach consists of looking at the successful programs (based 
on observations of ongoing or even completed programs) and then 
claiming that these are really the ones that should be the basis 
for the evaluation of the program as a whole. If these successful 
programs have worked with representative participants in repre- 
sentative surroundings and if the techniques used — including the 
quality of the administrative and operational personnel — can be 
replicated on a nationwide basis, then it makes sense to say that 
the evaluation of the particular program can stand for an evalua- 
tion of the overall program. But we can seldom assume these 
conditional statements. After all, each of the individual programs, 
a few political plums notwithstanding, was set up because someone 
thought it was worthwhile. Of course, some will flop because of poor 
teachers or because one or more operations were fouled up — but it 
is in the nature of the beast that some incompetent administrative 
and operational fool-vps will occur. A strength of summary, 
over-all measures of performance is that they will include "acci- 
dental" foul-ups with the "accidental” successes, the few bad 
administrators and teachers as well as the few charismatic leaders. 
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As a case in point, consider the success (according to prevailing 
opinion) of Reverend Sullivan's Operation Industrial Council in 
Philadelphia with the (as yet) absence of any evidence that the 
OIC idea has been successfully transferred elsewhere. 4 

Small scale studies of pre-selected particular programs are 
most useful either for assessing radically different program ideas 
or for providing the administrator with information relevant to 
decisions of program content within the confines of his overall 
program. These are important uses, but the decisions at a broader 
level which concern the allocation of resources among programs of 
widely differing concepts call for a different typo of evaluation 
with a focus on different variables. 

It may be helpful to cite an example of the way in which the 
replicability criterion should have been applied. A few years ago, 
a broad scale evaluation of the Work Experience Program 6 was 
carried out. (The evaluation was of necessity based upon very 
fragmentary data, but we are here concerned with the issues it 
raised rather than with its own merits.) The evaluation indicated 
that on the average the unemployment rates among the completers 
of the program were just as high as those with similar character- 
istics who had not been in the program. On the basis of this 
evaluation, it was argued that the concept of the program was 
faulty, and some rather major shifts in the design and in the 
allocation of resources to the program were advocated.® Other 
analysts objected to this rather drastic conclusion and argued that 
the “proper" evaluative procedure was to examine individual 
projects within the program, pick out those projects which had 
higher "success rates," and then attempt to determine which 
characteristics of these projects were related to those “success 
rates." 7 

The argument as to which approach is proper depends on the 
particular decision framework to which the evaluation results 
were to be applied. To the administrators of the program, it is 
really the project by project type of analysis which is relevant 
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to the decision variables which they control. The broader type of 
evaluation would be of interest, but their primary concern is to 
adjust the mix of program elements to obtain the best results 
within the given broad concept of the program. Even for program 
administrators, however, there will be elements and personnel 
peculiar to a given area or project that will not be replicable in 
other areas and other projects. 

For decision-makers at levels higher than the program adminis- 
trator the broader type of evaluation will provide the sort of 
information relevant to their decision frame. Their task is to 
allocate resources among programs based upon different broad 
concepts. Negative findings from the broader evaluation argue 
against increasing the allocation to the program, although a con- 
servative response might be to hold the line on the program while 
awaiting the more detailed project-by-project evaluation to deter- 
mine whether there is something salvageable in the concept em- 
bodied in the program. There will always be alternative programs 
serving the same population however, and the decision-maker is 
justified in shifting resources toward those programs which hold 
out the promise of better results. 

The basic point is that project-by-project evaluations are bound 
to turn up some “successful” project somewhere, but unless there 
is good evidence that that "success” can be broadly replicated and 
that the administrative controls are adequate to insure such repli- 
cation, then the individual project success is irrelevant. Resources 
must be allocated in light of evidence that concepts are not only 
“successful” on a priori grounds or in particular small-scale 
contexts but that they are in fact “successful” in large-scale 
implementation. 

C. The Theoretical Framework — Some Statistical 
Considerations. 

The main function of a theoretical framework in cost-benefit 
evaluations is to provide a statistical model suitable for testing. 
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In this section a few brief remarks will be made about the statis* 
tical design of the evaluation— a lengthier discussion of these 
matters is taken up in another paper. 7A In thess remarks we will 
adopt the terminology of regression analysis, which is a statistical 
method flexible enough to handle an analysis of variance approach 
or that involved in simply working with cell values in tables. In 
the regression model, the dependent variable hi the objective of 
the social action program and the particular set of independent 
variables of most interest to us are those that describe or represent 
the program, or program inputs. In this discussion the inde- 
pendent variables will sometimes be referred to as "treatment 
variables.” 

It may be useful to divide the problems of statistical design 
into two categories: First, attaining acceptable levels of statistical 
significance on the measured effects of the treatment variables; 
second, measuring those effects without bias. We will not discuss 
the first problems here except to note that the failure to attain 
statistical significance of the effect of the treatment variable occurs 
either because of large unexplained variation in the dependent 
variable or small effects of treatment variables and these can be 
overcome with sufficiently large sample sizes. In our opinion, the 
most serious defect in evaluation studies is biases in the measures of 
effects of the treatment variables, and this error is unlikely to be 
removed by enlarging the sample size. 

One source of bias is inaccurate measures of the treatment 
variable, but a more pervasive and more serious problem is the 
presence of variables, not included in the statistical model, which 
are correlated with both the dependent variable and the treatment 
variable. Had the assignment to a program been made on a random 
basis, the laws of probability would have assured a low correlation 
(zero in the limit of a large enough sample size) between partici- 
pation in the program and these omitted variables. In the absence 
of randomization, we must fall back on statistical controls. At 
this point our theory and a priori information are crucially im- 
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of independent variables is included, greater efficiency is obtained 
when the treatment variable is uncorrelated with the other inde- 
pendent variables. In the opposite extreme, if the treatment 
variables were perfectly correlated with some other variable or 
combination of variables, we would be unable to distinguish 
between which of the two sets of factors caused a change. It 
follows that even in the absence of randomization, designing the 
programs to be studied with as wide a range in levels and types 
of "treatments’’ as possible will serve to maximize the information 
we can extract from an ex post analysis. 

There are reasons in addition to those of statistical efficiency 
for planning for a wide range of values in the treatment of pro- 
grammatic variables. One is that social action programs have a 
tendency to change, rather frequently and radically, during the 
course of their operation. Evaluations designed to test a single 
type of program are rendered meaningless because the program- 
type perishes. But if the design covers a wider variety of pro- 
grams, then a built-in hedge against the effects of change is 
attained. Indeed, there is an even more fundamental reason why 
a wide range of inputs and program types should be planned for, 
and it is simply this: we seldom know enough about what will 
work in a social action program to justify putting our eggs in 
the single basket of one type of program. This evaluation model 
for a single type of project, sometimes described as the analogue of 
the "pilot plant," is not the appropriate model for social action 
programs given our current state of knowledge.® 

D. The Theoretical Framework — Some Economic 

Considerations. 

For operational purposes we will assume that the evaluation of 
each social action program can, at least in principle, be cast in the 
statistical model discussed in the previous section, complete with 
variables representing an objective of the program, treatment vari- 
ables representing the program inputs, control variables, and con- 
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trol groups . 10 However, the substantive theoretical content of 
these models — the particular selection of variables and their func- 
tional form — must come from one or more of the traditional dis- 
ciplines suoh as educational psychology (e.g., for Head Start), 
demography (e.g., for a family planning program), medical science 
(e.g., for a neighborhood health center), economics (e.g., for a man- 
power training program), and so on. 

Sooner or later economics must enter all evaluations, since 
‘‘costing out" the programs and the setting of implicit or explicit 
dollar measures of the worth of a program are essential steps in a 
complete evaluation. In making the required cost-benefit analysis, 
the part of economic theory that applies is the investment theory 
of public finance economics, with its infusion of welfare economics. 
The function of investment theory is to make commensurable in- 
puts and outcomes of a social action program which are spaced 
over time. 10 * Welfare economics analyzes the distinctions between 
financial costs and real resource costs, between direct effects of a 
program and externalities, and between efficiency criteria and 
equity (or distributional) criteria. 

We will say very little on the last mentioned distributional or 
equity question of who pays and who reeieves, even though we 
strongly feel that accurate data on the distribution of benefits and 
costs is essential to an evaluation of social action programs. How- 
ever, the task of conducting a "conventional’’ benefit-cost analysis 
(where the criterion is allocative efficiency) is sufficiently complex 
that we believe it preferable to separate the distributional questions. 

Program, Inputs. In the investment theory model costs are 
attached to all inputs of a program and a single number emerges 
which measures the present value of the resources used. Most of 
the technical problems faced by the analysts on the input side are 
those of traditional cost accounting. We will confine our remarks 
to the two familiar and somewhat controversial problems of op- 
portunity costs and transfer payments, which arise in nearly every 
manpower program. Both of these problems are most effectively 
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dealt with if one starts by asking: ‘What is the decision context 
for which these input measures are defined t 

The most general decision context — and the one to which eco* 
nomists most naturally refer — is that of the productivity of alter* 
native resources utilizations in society or the nation as a whole. 
In this case, one wishes to measure the cost of inputs in terms of 
the net reduction in value of alternative socially productive activi- 
ties caused by the use of the inputs in this particular activity. 
Now, the value of most inputs in terms of their alternative use 
will be more or less clearly indicated by their market price, but 
there are some inputs for which this will not be true. The most 
troublesome cases often concern the time of people. A well known 
example is the value of the time spent by students in school : since 
those over 14 or so could be in the job market, the social product 
(or national income) is less; therefore, an estimate is needed of 
what their earnings would be had they not been in school. (Such 
an estimate should reflect whatever amount of unemployment 
would be considered "normal.’’) For manpower programs the 
best evaluation design would provide a control group to measure 
the opportunity costs of the time spent by the trainees in the 
program. 

Sometimes the prices of inputs (market prices or prices fixed 
by the government) do not adequately reflect their marginal social 
productivity, and "corrected” or "shadow prices” are necessary. 
For example, the ostensible prices of leisure or of the housework of 
a wife are zero and obviously below their real price. By contrast 
a governmental fixed price of some surplus commodity is too high. 

The definition and treatment of transfer payments also depend 
on the decision context of the analysis. From the national perspec- 
tive money outlays from the budget of one program that are offset 
by reduced outlays elsewhere in society do not decrease the value 
of the social product When these outlays are in the form of cash 
payments or consumption goods, they are called transfer payments. 
An example is the provision of room and board for Job Corps 
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ments for the cost of room and board to a Job Corpsman, which 
was considered a transfer payment above, would now be considered 
an input cost from the “taxpayer’s viewpoint." The fact that the 
trainee or his family is relieved of this burden would be of no in- 
terest since it would not be reflected in the public budget. However, 
if the costs of room and board had been met previously by a public 
welfare agency, then from the “taxpayer’s viewpoint," the costs 
would not be charged to the Job Corps program. 

It is not uncommon to see several decision contexts used in one 
analysis, and used inconsistently. For example, the post-training 
earnings improvement from participation in a Job Corps program 
are considered benefits. We all recognize, of course, that the earn- 
ings will be used mostly for consumption by the Job Corps gradu- 
ate. But in the same study, his consumption during training (room, 
meals, and spending allowance), is not viewed as conferring benefits 
to the corpsman . 12 Or is it that the benefits should not count be- 
cause while in training, he is not considered a member of “our 
society!" We leave this puzzle to those who prefer these restricted 
decision contexts. There are other such examples and still other 
and more narrow decision contexts, such as that of a local govern- 
ment or of the project by itself. But it is probably clear that our 
preference is for the national or total societal perspective. 

Program Outcomes. The problems of measurement on the out- 
come side of the evaluation problem are tougher to handle, and ex 
post evaluations of social action programs face particular problems 
because these outcomes are likely to involve behavioral relationships 
which are not well understood. It is particularly difficult to predict 
long run or permanent behavioral changes from the short run in- 
dicators revealed by the on-going or just completed program. 

The outcomes we wish to measure from many social action pro- 
grams occur months or years after the participants have completed 
the program. We can use proxy measures, which can themselves be 
measured during and soon after the program, but follow-up studies 
are clearly preferred and may in many cases be essential. A good 
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deal depends on the confidence we have in the power of our theories 
to link the proxies or short-run effects (e.g., test scoresi health 
treatments, employment experience in the short-run, etc.) with 
the longer run goals (longer run educational attainment, longevity, 
incomes, or all of these and perhaps other ‘‘softer" measures of 
"well-being”). It is a role for ‘‘basic research” in the social sciences 
to provide this type of theoretical-empirical information to evalua- 
tions, but we can also hope that the more thorough evaluation 
studies will contribute to our stock of “basic research findings. 

The major obstacle to follow-up measures is the difficulty in 
locating people, particularly those from disadvantaged populations 
who may be less responsive and who have irregular living patterns. 
The biases due to nonrespoue may be severe, since those partici- 
pants who are easiest to locate are likely to be the most “successful,” 
both because of their apparent stability and because those who have 
"failed” may well be less responsive to requests to reveal their cur- 
rent status. One way around the costly problem of tracking down 
respondents for earnings data is to use Social Security records for 
participant and control groups. The rights of confidentiality may 
be preserved by aggregating the data. 

Another problem in measuring outcomes, which also tends to 
be more talked about despairingly than coped with positively, is the 
category of external or third-party effects of the program. As a 
typical illustration consider a youth training program, which not 
only increases the earnings of the youths, but also reduces the in- 
cidence of crime among these groups, which generally benefits the 
community — e.g. less damage and lower costs of prevention and re- 
habilitation programs. Another source of third-party effects are 
those accruing to the participant’s family members, including those 
yet to be born. It is an open question, however, whether the prob- 
lem for concern is the lack of measurement of these external effects, 
or the tendency by administrators and others (particularly friends 
of the programs) to exaggerate their likely importance and to 
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vide for third-party individuals in the community. Thus, we are not 
proposing that the "community” he viewed as an “entity" separate 
from the individuals who comprise it. However, a separate focus 
on measures of co mmuni ty institutional changes appears necessary 
since the present state of our theories of community organization 
permit us little scope for anything except qualitative linkages be- 
tween institutional changes and their effects on individuals in the 
community. We can, for example, consider better communication 
between the neighborhood populace and the police, school officials, 
or the employment service as "good things,” either in their own 
right, as expressions of the democratic ethic, or because we believe 
that such changes will have tangible effects in safety, school achieve- 
ment or better jobs. 



Intentional Experiment*: A Suggested Strategy 

Underlying the growing interest in evaluations of social action 
pi Gurams is the enlightened idea that the scientific method can be 
applied to program experience to establish and measure particular 
cause and effect relationships which are amenable to change through 
the agents of public policy. However, traditional methods in science, 
whether the laboratory experimentation of the physical scientists, 
the testing of pilot models by engineers, or fie’d testing of drugs 
by medical scientists, are seldom models that can be directly copied, 
helpful though they are as standards of rigor. 

In particular, evaluation designs patterned after the testing of 
pilot models, which correspond to "demonstration projects" in the 
field of social action programs, have been inadequate for both 
theoretical and operational reasons. The present state of our theories 
of social behavior does not justify settling on a unique plan of 
action, and we cannot, almost by definition, learn much about alter- 
native courses of action from a single pilot project. It is somewhat 
paradoxical that on the operational level the pilot model has failed 
to give us much information because the design has frequently 
been impossible to control and has spun off in different directions. 
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The combination of, first, loose administration of and rapid 
changes in the operation of individual projects and second, a large 
scale program with many heterogeneous projects (different ad- 
ministrations, different environments, different clientele, etc. ) , has 
led to the interesting view that this heterogeneity creates what are, 
in effect, "natural experiments” for an evaluation design. For econ- 
omists, who are used to thinking of the measurement of consumers’ 
responses to changes in the price of wheat or investors’ responses 
to changes in the interest rate, the idea of "natural experiments” 
has a certain appeal. But what should be clear from this discus- 
sion — and others before us have reached the same conclusion — is 
that a greatly improved evaluation could be obtained if social action 
programs were initiated in intentional experiments. 

When one talks of "experiments” in the social sciences what 
inevitably comes to mind is a small scale, carefully controlled 
study, such as those traditionally employed in psychology. Thus, 
when one suggests that social action programs be initiated as inten- 
tional experiments, people imagine a process which would involve 
a series of small test projects, a period of delay while those pro- 
jects are completed and evaluated, and perhaps more retesting 
before any major program is mounted. This is very definitely not 
what we mean when we suggest social action programs as inten- 
tional experimentation. We would stress the word action to high- 
light the difference between what we suggest versus the traditional 
small scale experimentation. 

Social action programs are undertaken because there is a clearly 
perceived social problem that requires some form of amelioration. 
In general, (with the exception perhaps of the area of medicinal 
drugs were a counter tradition has been carefully or painfully 
built up), we are not willing ti postpone large scate attempts at 
amelioration of such pioblems until all the steps of a careful testing 
of hypotheses, development of pilot projects, etc. have been carried 
out. We would suggest that large scale ameliorative social action 
and intentional experimentation are not incompatible; experi- 
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mental designs can be built into a large scale social action program. 

If a commitment is made to a more frankly experimental social 
action program by decision-makers and administrators, then many 
of the objectives we have advocated can be addressed directly at 
the planning stage. If we begin a large national program with a 
frank awareness that we do not know which program concept is 
more likely to be most efficacious, then several program models 
could be selected for implementation in several areas, with enough 
variability in the key elements which make up the concepts to 
allow good measures of the differential responses to those elements. 
If social action programs are approached with an “intentionally ex- 
perimental’’ point of view, then the analytical powers of our sta- 
tistical models of evaluation can be greatly enhanced by attempts 
to insure that “confounding’’ effects are minimized — i.e., that pro- 
gram treatment variables are uncorrelated with participant char- 
acteristics and particular types of environments. 

A less technical but equally important gain from this approach 
to social action programs is the understanding on the part of ad- 
ministrators, decision-makers, and legislators that if we are to 
learn anything from experience it is necessary to hold the design 
of the program (that is, the designed project differentials in treat- 
ment variables) constant for a long enough period of time to allow 
for the "settling down" of the program and the collection and 
analysis of the data, A commitment to hold to design for a long 
enough period so that we could learn from experience is a central 
element in the experimental approach to social action. 

The idea that social action programs should be experimental 
is simple, but we cannot be sanguine about the speed with which 
the full implications of this simple idea will be accepted by de- 
cision-makers and the public as a whole. The view that programs 
can be large scale action programs and still be designed as inten- 
tional experiments has not been easy to get across, even to those 
trained in experimental methods in the social sciences, with its 
tradition of small scale research. 
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The emphasis on ex post evaluation is evidence of the fact that 
at some level legislators understand that social action programs are 
"testing” concepts. But it will require more explicit acceptance of 
the idea that some aspects of programs “tested'' in action will fail 
before the full advantages of the intentionally experimental ap- 
proach can be realized. Jt takes restraint to mount a program with 
a built-in experimental design and wait for it to mature before de- 
ciding on a single program concept, but we emphasize that restraint 
does not mean small scale or limited action. 

It is not unfair, we think, to characterize the approach to social 
action programs that has been taken in the past as one of serial 
experimentation through program failure. A program is built 
around a single concept, eventually it is realized that it does not 
work, so the program is scrapped (or allowed to fade away) and 
a new program and concept is tried. Certainly serial experimen- 
tation through failure is the hard way to learn. An intentionally 
experimental approach would allow us to learn faster by trying 
alternative concepts simultaneously and would make it more likely 
that we could determine not only that a particular concept failed, 
but also why it failed. 

The Acceptability of Evaluation Results 

It does little violence to the facts to state that few decisions 
about social action programs have been made on the basis of the 
types of evaluations we have been discussing thus far in this paper. 
A major reason for this, we feel, is an inadequate taste for rigor 
(or an overweening penchant for visceral judgments) by adminis- 
trators and legislators and excessive taste for the purely scientific 
standards by academics. It often seems that the scholars conspire 
with the legislators to beat down any attempt to bring to bear more 
orderly evidence about the effectiveness of alternative programs; it 
is not at all difficult to find experts who will testify that virtually 
any evaluation study Is not adequately "scientiuo" to provide a 
sound basis for making program decisions. There is a reasonable 



METHODOLOGY OF EVALUATION 



29 



and appropriate fear on the part of academies that sophisticated 
techniques of analysis will be used as deceptive wrapping around 
an essentially political kernel to mislead administrators or the 
public. This fear, however, often leads to the setting of standards 
of "proof’' which cannot, at present, given the state of the art of 
social sciences, or perhaps never, given the inherent nature of social 
action programs, be satisfied. The result generally is that the eva- 
luation is discredited, the information it provides ignored, and the 
decision-maker and legislator can resume the exercise of their vis- 
ceral talents. 

A first step toward creating a more favorable atmosphere for 
evaluation studies is to recognize that they will not be final arbiters 
of the worth of a program. A positive but more modest role for 
evaluation research was recently stated by Kenneth Arrow in a 
discussion of the relative virtues of the tradition processes of 
public decision-making (characterized as an adversary process) 
and the recently developed procedure of the Programming, Plan- 
ning, Budgeting System (characterized as a rationalistic or 
"synoptic process ”. 16 Arrow advocated an approach in between 
forensics and synoptics . 17 He illustrated his argument by making 
an analogy with the court system, suggesting that what was hap- 
pening through the introduction of the more rationalistic processes 
was the creation of a body of "rules of evidence.” The use of sys- 
tematic evaluation (along with the other elements of the PPBS) 
represents an attempt to raise the standards of what is admissible 
as evidence in a decision process that is inherently likely to remain 
adversary in nature. Higher standards of evaluation will lessen 
the role of "hearsay" testimony in the decision process, but they 
are not meant to provide a hard and fast decision rule in and of 
themselves. The public decision-making process is still a long way 
from the point at which the evidence from a hard evaluation is the 
primary or even the significant factor in the totality of factors 
which determine major decisions about programs. Therefore, the 
fear of many academics that poorly understood evaluations will ex- 



30 



PUBLIC-PRIVATE MANPOWER POLICIES 



ereise an inordinate influence on public decisions is, to say the 
least, extremely premature. But if standards for the acceptance 
of evaluation results are viewed in terms of the "rules of evidence” 
analogy, we can begin to move toward the judicious mix of rigor 
and pragmatism that is so badly needed in evaluation analyr*. 

The predominant view of the role of "serious,” independent eva- 
luations 18 (particularly in the eyes of harried administrators), 
seems to be that of a trial (to continue the analogy) aimed at find- 
ing a program guilty of failure. There is a sense in which this para- 
noid view of evaluation is correct. The statistical procedures used 
usually start with a null hypothesis of “no effect,” and the burden 
of the analysis is to provide evidence that is sufficiently strong to 
overturn the null hypothesis. As we have pointed out, however, 
problems of data, organization, and methods conspire to make clear- 
cut positive findings in evaluations difficult to demonstrate. 

The atmosphere for evaluations would be much healthier if the 
underlying stance were shifted from this old world juridical rule. 
Let the program be assumed innocent of failure until proven guilty 
through clear-cut negative findings. In more precise terms, we 
should try to avoid committing what are called in statistical theory 
Type II errors. Thus, an evaluation which does not permit rejecting 
the null hypothesis (of a zero effect of the program) at customary 
levels of statistical significance, may be consistent with a finding 
that a very large positive effect may be just as likely as a zero or 
negative effect . 18 "Buies of evidence" which emph&a'ze the avoid- 
ance of Type II errors are equivalent to an attitude which we have 
characterized as "innocent until proven guilty." (We must frankly 
admit that, like court rules of evidence, this basic stance may pro- 
vide incentives to the program administrators to provide data which 
are sufficient only for arriving at a "no conclusion" evaluative 
outcome.) 

As a final conciliatory comment; when we talk about evaluation 
studies leading to verdicts of "success" or "failure," it should be 
recognized that we are greatly simplifying and abbreviating the 
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typical results. Most social action programs are so complex in the 
variety of inputs and the multiplicity of objectives, that simple 
over-all judgments are not likely to lead to quick decisions to dump 
programs. In combination with more detailed studies, the purpose 
of the evidence provided by the analysts will instead usually be to 
suggest modifications in the program — to shift the composition of 
inputs, perhaps to re-emphasize some objectives and de-emphasize 
others — and to suggest marginal additions or subtractions in the 
total scale of the program. It is worth emphasizing these modest 
objectives because the trust and cooperation of program administra- 
tors are indespensable to an evaluation of the program. 
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jective judgments, and that, In any case, the objective portion is necessary to an im- 
proved over-all judgment that spans both measurable and unmeasurable inputs and out- 
puts of a program, 

We bypass here the Important Question of the choice of a discount rste. 8ome dis- 
cussion of this Issue it provided in the Institute for Research on Poverty version of this 
paper. 

11 When the program produces an increase in consumption of goods and services, the 
treatment of these transfer payments can become more complicated if we do not assume 
that the goods and service have a value to the recipients equal to their cost. See A. A. 
Alchlan and W. R. Allen, UnivtrtUy Economic* (Wadsworth: Belmont, California, 1967, 
Second Edition) pp. 186-140 for an extended discussion. 

u For just one of many examples of IhU type of treatment of transfer payments 
see, "The Feasibility of Benefit-Cost Analysis In the War on Poverty: A Test Application 
to Manpower Programs,” prepared for the General Accounting Office, Resource Manage- 
ment Corporation, UR-054, December 18, 1668. * 

11 For a notable exception to the absence of attempted measurement of tfae type 
of third-party discussed above, see Thomas I. Ribich, Education and Poverty (Washing- 
ton, D.O.: The Brookings Institution, 1966). RfbfcVa study also gives us some evidence 
of the likelihood of relatively smatl Quantitative magnitudes of these effects. A rather free 
wheeling listing of third-party effects runs the risk of double counting benefits. For ex- 
ample, although other fatally members benefit from tbs better education or earnings of ths 
head of the household, we should not forget that had the investment expenditure been 
elsewhere, tren if In the form of an acroea-the-boatd tax cut, etAsr family heads would 
have had larger incomes, at least, with resulting benefit* to tfcrir families. In hit ex- 
amination of cost-benefit analysis of water resources development#, Roland N. McKean 
gives an extended discussion of the pitfalls of double counting. 8ee his BflcUncy in Oot~ 
smaMfti Through 8y*Um* AnatyfU (New York: John Wiiey and Sons, Ine^ 1958), 
especially Chapter 9. 

»An exceptionally good discussion of negative external effects, Including disruption 
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of the community structure, fa contained in Anthony Downs, "Uncompensated Non* 
Construction Costs Which Urban Highways and Urban Renewal Impose on Residential 
Households** which will appear in a UnlversiUesNational Bureau of Economic Research 
Conference volume entitled, Economics of Public Output. The literature on urban renewal 
and public housing is extensive and too well known to require listing here. 

a For an excellent discussion of many of these issues see Joel F. Handler, "Con- 
trolling Official Behavior in Welfare Administration/* The Law of tk$ Poor, ed., J. 
tenBroek (Chandler Publishing Co., 1966). (Also published in Th$ California Law Re- 
view, Vot. 64, 1966, p. 479.) 

*• For a more complete discussion of this terminology, see Henry Rowen, ‘'Recent 
Developments in the Measurement of Publfe Outputs,* 1 to be published in a Universities* 
National bureau of Economic Research Conference volume, The Economice of Public 
Output. 

,T Remarks by Kenneth Arrow during the NBER conference cited in the previous 
footnote. 

u We mean here to exclude the quick and casual sort of evaluations, mainly "In- 
house" evaluations, that more often than not are meant to provide a gloss of technical 
Justification for a program, 

lt Harold Watts has stressed this point in conversations with the authors. See Glen 
G. Cain and Harold W. Watts, "The Controversy about the Coleman Report: Comment,* 1 
Journal of Human Resources, Vol. Ill, No. 6, Summer, 1968, pp. 889*92, also, Harold 
W. Watts and David h. Horner, "The Educational Benefit* of Head 8tart: A Quantita- 
tive Analysis," Discussion Paper Series, The Institute for Research on Poverty, University 
of Wisconsin, Madison, Wisconsin. 
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